MLOps Lifecycle and Production Monitoring Guide - Healthcare Client Focus

mlops databricks
playbook
Guide
Healthcare
Author

Gary Fischer

Published

February 27, 2026

MLOps Lifecycle & Production Monitoring Guide - Healthcare Client Focus

Healthcare Actuarial Models — Azure Databricks

Assumed Stack: Azure Databricks · Unity Catalog · MLflow · Delta Lake · dbt · Azure DevOps / Git · PySpark / Python / SQL  

Inference Pattern: Batch scoring (no real-time endpoints)  

Version: 1.0 — February 2026


Table of Contents

  1. Foundational Principles

  2. Roles & Responsibilities Matrix

  3. The MLOps Lifecycle — End to End

  4. Production Model Monitoring — Deep Dive

  5. Observability Architecture

  6. Drift Detection Framework

  7. Performance & Accuracy Monitoring

  8. Retraining Pipeline & Champion/Challenger

  9. Alerting, Escalation & Incident Response

  10. Healthcare-Specific Considerations

  11. Reference Architecture Diagram


1. Foundational Principles

The Databricks MLOps guidance (often referred to as the “Big Book of MLOps”) establishes a maturity model that moves teams from ad-hoc notebook experimentation toward fully governed, automated model lifecycles. For healthcare actuarial models — where predictions drive care decisions and financial projections — we target MLOps Maturity Level 3: automated pipelines, governed model promotion, continuous monitoring, and auditable lineage.

Five principles anchor every decision in this guide:

  1. Models are code. Model training logic, feature engineering, and scoring pipelines live in version-controlled repositories in Azure DevOps, not in manually-edited notebooks.

  2. Data is a first-class artifact. Every model depends on the quality and stability of its input data. dbt transformations and Delta Lake versioning give us reproducibility from raw claims through scored output.

  3. Environments are isolated. Development, Staging, and Production are separate Databricks workspaces (or at minimum, separate catalogs in Unity Catalog). Code promotes across environments; data does not leak between them.

  4. Promotion is gated, not automatic. No model reaches Production without passing validation checks, statistical comparison to the current champion, and human approval from the Actuarial team.

  5. Monitoring is not optional. A model without monitoring is a liability. Every model in Production has drift detection, performance tracking, and an alerting contract.


2. Roles & Responsibilities Matrix

The MLOps lifecycle is not a single team’s job. It requires coordinated handoffs between roles, each owning a specific slice of the pipeline. In a healthcare context, we add Actuarial SMEs and Compliance as first-class participants — they are not consulted at the end, they are embedded throughout.

2.1 Role Definitions

Data Engineer (DE)

Owns the data platform. Responsible for ingesting raw data (claims, eligibility, pharmacy, lab, EMR), building and maintaining dbt models that produce Bronze → Silver → Gold layers in Delta Lake, and ensuring data quality and freshness SLAs. The DE does not build ML models but is accountable for the data those models consume. In practice, the DE is the first person paged when a model monitoring alert fires, because the root cause is data more often than it is model logic.

Data Scientist (DS)

Owns the model. Responsible for exploratory analysis, feature engineering, model training, hyperparameter tuning, and validation. The DS works in the Development environment, using MLflow to track experiments. They write the training code that will eventually run as an automated pipeline in Production. The DS also owns the statistical validation that a retrained challenger model is equivalent to or better than the current champion. For actuarial models, the DS works hand-in-hand with the Actuarial SME to ensure clinical and financial validity.

MLOps / ML Engineer (MLE)

Owns the production pipeline. Responsible for converting the DS’s training notebooks into production-grade, parameterized jobs; building the scoring pipeline; implementing CI/CD in Azure DevOps; configuring monitoring and alerting; and operating the champion/challenger promotion workflow. The MLE is the bridge between “it works in my notebook” and “it runs reliably at 2 AM every Sunday.” They own the MLflow Model Registry lifecycle (None → Staging → Production → Archived) and the Databricks Workflows that orchestrate scoring.

Infrastructure / Platform Engineer (Infra)

Owns the Databricks platform, Azure networking, compute provisioning, and infrastructure-as-code (Terraform or Bicep for Azure resources, Databricks Asset Bundles for workspace objects). Responsible for cluster policies, instance pools, Unity Catalog metastore configuration, Azure Key Vault integration, and Private Link / VNet injection. They do not touch model code but ensure the platform is secure, cost-efficient, and reliable.

Security & Compliance (Sec)

Owns access control, PHI protection, and regulatory compliance. Responsible for Unity Catalog permissions (table-level, column-level masking for PHI), Azure AD group mappings, audit log configuration, and ensuring the platform meets HIPAA, state DOI, and CMS requirements. In healthcare, this role also reviews model monitoring dashboards for any inadvertent PHI exposure and validates that audit trails are complete.

Actuarial SME / Business Owner (ACT)

Owns the business logic and acceptance criteria. Responsible for defining model requirements, reviewing validation reports, approving model promotions to Production, and interpreting monitoring results in a clinical and financial context. The Actuarial team is the final approval gate — no model goes live without their sign-off. They also define the performance thresholds that trigger retraining alerts (e.g., “if the predictive ratio deviates more than 5% from 1.0, we need to investigate”).

2.2 RACI by Lifecycle Phase

Phase DE DS MLE Infra Sec ACT
Data Ingestion & dbt Pipeline A/R C C C C I
Feature Engineering C A/R C I C C
Model Training & Experimentation I A/R C I I C
Model Validation & Testing C A/R R I I A
CI/CD Pipeline & Automation C C A/R C C I
Model Registry & Promotion I R A/R I C A
Scoring Pipeline (Production) C C A/R C C I
Monitoring & Drift Detection C R A/R C I C
Alerting & Incident Response R R A/R C I C
Retraining Decision I R R I I A
Security & Access Control C I C R A/R I
Regulatory Audit Support C C C C A/R R

A = Accountable (final decision), R = Responsible (does the work), C = Consulted, I = Informed


3. The MLOps Lifecycle — End to End

This section walks through the complete lifecycle as it applies to the four distinct actuarial models being migrated into Databricks. Each phase is annotated with who does the work and what artifacts are produced.

3.1 Data Ingestion & Feature Store

Owner: Data Engineer  

Artifacts: dbt models, Delta tables in Unity Catalog, data quality reports

Raw data lands in ADLS Gen2 from upstream systems (claims adjudication, eligibility, pharmacy benefits, lab feeds, EMR extracts, vendor files). The DE builds dbt / databricks models that transform raw data through the medallion architecture:

  • Bronze: Raw ingestion, append-only, schema-on-read. Minimal transformation — just land it.

  • Silver: Cleaned, deduplicated, conformed. Claims are adjudicated, eligibility spans are resolved, member keys are unified. dbt tests enforce referential integrity and not-null constraints.

  • Gold: Model-ready feature tables. Aggregations at the member-month or member-quarter grain. Lookback windows applied (e.g., 12-month rolling claims for Risk Stratification, 6-month diagnosis history for Palliative Care). These Gold tables are the contract between DE and DS.

The DE registers Gold tables in Unity Catalog under a features schema. For models that share features (Risk Stratification and Concurrent Risk Score likely share claims-based features), the DE publishes shared feature tables to avoid duplication.

Data quality is enforced at every layer using dbt tests (unique, not_null, accepted_values, relationships) and optionally Great Expectations for statistical checks. Quality results are written to a data_quality.test_results Delta table for monitoring.

3.2 Experimentation & Model Development

Owner: Data Scientist  

Artifacts: MLflow experiments, training notebooks, feature importance analysis, validation notebooks

The DS works in the appropriate Databricks workspace (or catalog). They read from feature tables, explore data, engineer additional features, and train models. All experiments are logged to MLflow Tracking:

  • Parameters (hyperparameters, feature sets, lookback windows)

  • Metrics (AUC, precision, recall, calibration, Brier score, predictive ratio, R² on cost)

  • Artifacts (model binaries, feature importance plots, calibration curves, SHAP summaries)

  • Input data signature and example input (for schema enforcement downstream)

For the actuarial models, the DS pays special attention to:

  • Risk Stratification: Calibration is as important as discrimination. The model must not just rank members correctly but produce well-calibrated probability estimates that translate to expected cost.

  • Palliative Care: Sensitivity at high specificity — false negatives (missing a patient who would benefit from palliative care) are more costly than false positives. The DS should log precision-recall curves at multiple thresholds.

  • Concurrent Risk Score: This is a proprietary scoring methodology. The DS must document the algorithm specification thoroughly because this model may have regulatory or contractual IP implications. All coefficients, weights, and business rules must be version-controlled.

  • Impact Model: TBD

The DS does not deploy models. When satisfied with a candidate, they register it in the MLflow Model Registry with a None or Staging stage and create a pull request in Azure DevOps that triggers the CI/CD pipeline.

3.3 CI/CD Pipeline & Automated Validation

Owner: MLOps Engineer  

Artifacts: Azure DevOps pipeline YAML, test reports, validation notebooks, promotion records

The MLE builds and maintains the CI/CD pipeline in Azure DevOps. The pipeline is triggered by a pull request to the main branch (or a release/* branch) and executes the following stages:

CI Stage (triggered on PR):

  1. Lint and static analysis on all Python/PySpark code (ruff, mypy)

  2. Unit tests for feature engineering functions and scoring logic

  3. Integration test: run the training pipeline on a small sample dataset in the Staging workspace

  4. Validate the resulting model against the current Production champion:

   - Load the champion model from MLflow Registry (Production stage)

   - Load the challenger model from the CI run

   - Score both on a held-out validation dataset

   - Compare metrics (AUC, calibration, predictive ratio, etc.)

   - Compute Population Stability Index (PSI) between champion and challenger score distributions

   - Generate a validation report artifact

CD Stage (triggered on merge to main, after approval):

  1. Deploy training pipeline as a Databricks Workflow in the Production workspace

  2. Deploy scoring pipeline as a Databricks Workflow

  3. Deploy monitoring notebooks and alerting configuration

  4. Register the model in the Production MLflow Registry as Staging

  5. Do not promote to Production stage automatically — this requires Actuarial approval (see Section 3.4)

The pipeline YAML templates are designed to be reusable across all four models. Model-specific configuration (feature table paths, metric thresholds, scoring schedule) is parameterized in a model_config.yml file per model.

3.4 Model Promotion & Governance

Owner: MLOps Engineer (executes) · Actuarial SME (approves)  

Artifacts: Promotion request, validation report, approval record, model card

Model promotion follows a strict governance workflow:

  1. MLE generates a Promotion Request containing: validation report, metric comparison, drift analysis, data lineage, and a model card.

  2. DS reviews the statistical validity and signs off.

  3. ACT reviews the business validity — do the scores make clinical and financial sense? Are there any subpopulation biases? Is the model aligned with the actuarial filing?

  4. Sec confirms that no new PHI columns were introduced and access controls are correct.

  5. MLE transitions the model in MLflow Registry from Staging to Production and archives the previous champion.

This promotion is logged as an auditable event. In Unity Catalog and MLflow, the full lineage is preserved: which data version trained the model, which code version produced it, who approved it, and when.

3.5 Batch Scoring in Production

Owner: MLOps Engineer  

Artifacts: Scored Delta tables, scoring run logs, output delivery confirmations

Batch scoring is the steady-state operation for all four models. The MLE owns the Databricks Workflows that orchestrate scoring:

  1. Trigger: Scheduled (e.g., monthly for Risk Stratification, weekly for Palliative Care) or event-driven (new claims data loaded).

  2. Load model: The scoring notebook loads the Production stage model from MLflow Registry using mlflow.pyfunc.load_model("models:/<model_name>/Production").

  3. Load features: Read from the Gold feature tables. Validate row counts and data freshness before scoring.

  4. Score: Apply the model to produce predictions. For batch, this is a model.predict() call over a Spark DataFrame, leveraging mlflow.pyfunc.spark_udf() for distributed scoring.

  5. Post-process: Apply business rules, score banding, exclusion logic, and output formatting.

  6. Write output: Scored results written to a scored_output Delta table in the Gold layer, partitioned by scoring date.

  7. Deliver: Outputs pushed to downstream systems (care management platform, actuarial reporting, Power BI datasets) via ADF, JDBC, or file export.

  8. Log: Scoring metadata (model version, input row count, output row count, min/max/mean score, runtime, cluster ID) written to a scoring_runs Delta table.

Failure handling: If any step fails, the Workflow retries once, then alerts the MLE via PagerDuty/Teams. The scoring table is not updated with partial results — it is all-or-nothing per run.


4. Production Model Monitoring — Deep Dive

This is the most critical section of this guide. A model in Production without monitoring is a model accumulating silent risk. For healthcare actuarial models, silent degradation can mean misidentified high-risk members, missed palliative care candidates, or inaccurate financial projections.

Monitoring for batch prediction models differs fundamentally from real-time endpoint monitoring. There is no request latency to track, no throughput to measure, no endpoint availability to check. Instead, we monitor the statistical properties of predictions and inputs over time, and we do so on a per-scoring-run cadence.

4.1 What We Monitor (The Four Pillars)

Pillar What It Detects Urgency Primary Owner
Data Quality & Pipeline Health Missing data, schema changes, volume anomalies, stale features, dbt test failures Immediate — blocks scoring Data Engineer
Feature Drift (Input Drift) Shifts in the statistical distribution of model input features between training and scoring Days to weeks — early warning Data Scientist + MLE
Prediction Drift (Output Drift) Shifts in the distribution of model predictions (scores) over time Days to weeks — early warning MLE + Actuarial
Performance Degradation (Concept Drift) Decline in model accuracy/calibration when ground truth becomes available Weeks to months — lagged but highest severity Data Scientist + Actuarial

These four pillars are layered intentionally. Data quality issues surface first (within hours of a bad data load). Feature drift surfaces next (within the current scoring cycle). Prediction drift surfaces alongside feature drift. Performance degradation surfaces last, because ground truth in healthcare is delayed — you don’t know if a Risk Stratification prediction was correct until claims mature, which can take 3-12 months.

4.2 Monitoring Architecture

The monitoring system is itself a set of Databricks Workflows and Delta tables. It is not a separate platform — it lives in the same workspace as the models it monitors, which keeps lineage intact and avoids data export.

Mermaid Chart - ProductionModel_Monitoring_Arch.png

5. Observability Architecture

Observability goes beyond monitoring. Monitoring tells you something is wrong. Observability tells you why it is wrong and where to look. For batch scoring models, observability means being able to trace any individual prediction back through the pipeline to the raw data that produced it.

5.1 The Three Pillars of ML Observability

Logging

Every scoring run produces structured logs written to the scoring_runs Delta table:

Column Type Example
run_id string (UUID) a1b2c3d4-...
model_name string risk_stratification
model_version int 7
mlflow_run_id string mlflow-run-abc123
score_date date 2026-02-15
input_row_count long 1,245,000
output_row_count long 1,245,000
score_mean double 0.342
score_median double 0.287
score_std double 0.198
score_min double 0.001
score_max double 0.997
score_p10 double 0.089
score_p90 double 0.621
null_prediction_count long 0
feature_table_version long 42 (Delta version)
cluster_id string 0215-143022-abc
runtime_seconds int 847
status string SUCCESS
error_message string null

This table is the single source of truth for “what happened.” The MLE builds it; the DS, ACT, and Sec consume it.

Metrics

Quantitative signals computed per scoring run and stored in the drift_metrics and performance_metrics Delta tables. Covered in detail in Sections 6 and 7.

Traces (Lineage)

Every prediction can be traced back:

  • Prediction → Scoring Run: The scored_output table includes run_id and model_version.

  • Scoring Run → Model: The run_id maps to an MLflow run, which records the model artifact, training data version, and code commit.

  • Model → Training Data: MLflow logs the Delta table version used for training via mlflow.log_input().

  • Training Data → Raw Source: dbt lineage graphs trace Gold features back through Silver and Bronze to raw ingestion.

This full lineage is critical for healthcare audit. When a regulator or auditor asks “why was this member scored as high-risk?”, you can trace the answer from the prediction all the way back to the claims data that drove it.

5.2 Observability by Role

Role What They Observe Where They Look
DE Data freshness, dbt test results, row counts, schema changes dbt Cloud dashboard, data_quality.test_results table, ADF monitoring
DS Feature distributions, model metrics, SHAP values, calibration MLflow UI, monitoring dashboard, ad-hoc notebooks
MLE Scoring run health, drift metrics, pipeline failures, alert history Databricks Workflows UI, monitoring dashboard, PagerDuty
Infra Cluster performance, job costs, workspace health, network issues Azure Monitor, Databricks admin console, cost dashboards
Sec Access logs, PHI access events, permission changes Unity Catalog audit logs, Azure AD logs, SIEM
ACT Score distributions, population shifts, business metric alignment Monitoring dashboard (read-only), monthly model health report

6. Drift Detection Framework

Drift detection is the early warning system. It does not tell you the model is wrong — it tells you the world the model was trained on may no longer match the world the model is scoring in. For healthcare, drift is common: member populations shift during open enrollment, coding practices change with ICD updates, pandemic-era utilization patterns normalize.

6.1 Feature Drift (Input Drift)

Feature drift measures whether the distribution of each input feature at scoring time has shifted from its distribution at training time. The reference distribution is computed once when the model is trained and stored as an artifact.

Method: Population Stability Index (PSI)

PSI is the standard metric for detecting distribution shifts in actuarial and credit risk modeling. It works for both continuous and categorical features.


PSI = Σ (Actual% - Expected%) × ln(Actual% / Expected%)

Where Actual% is the proportion of observations in each bin at scoring time, and Expected% is the proportion at training time. For continuous features, use 10 equal-frequency bins from the training distribution.

PSI Value Interpretation Action
< 0.10 No significant shift None
0.10 – 0.25 Moderate shift, investigate DS reviews; logged to dashboard
> 0.25 Significant shift Alert triggered; DS + ACT investigate

Which features to monitor: Not all features. The MLE and DS jointly select the top 15-20 features by importance (from SHAP or model-native feature importance). Monitoring all 200+ features creates noise. Focus on the ones that actually drive predictions.

Typical Healthcare Model-specific feature drift concerns:

  • Risk Stratification: Monitor diagnosis code density (count of unique HCCs), pharmacy cost features, and inpatient admission counts. These shift during open enrollment and after CMS HCC model updates.

  • Palliative Care: Monitor ADL (Activities of Daily Living) scores, hospitalization frequency, and diagnosis severity markers. These shift as the population ages or as clinical documentation practices change.

  • Concurrent Risk Score: Monitor the concurrent claims features closely — this model is sensitive to claims maturity lag. If the scoring pipeline runs before claims are fully adjudicated, feature distributions will appear to shift when the real issue is data completeness.

  • Impact Model: Monitor intervention enrollment features and comparison group characteristics. Selection bias in who receives interventions can create apparent drift.

Implementation — who does what:

  1. DS identifies the features to monitor and computes the reference distributions during training. Stores them as a JSON artifact in MLflow.

  2. MLE builds the monitoring notebook that loads the reference distributions, computes PSI for each feature on the current scoring run’s input data, and writes results to drift_metrics.

  3. MLE configures alerting thresholds.

  4. DS investigates when alerts fire, determining whether the drift is real (population changed) or artifactual (data pipeline issue).

  5. DE investigates if the DS suspects a data pipeline issue.

  6. ACT is consulted if the drift is real, to determine whether retraining is needed.

6.2 Prediction Drift (Output Drift)

Prediction drift measures whether the distribution of model outputs (scores) has shifted from a baseline. The baseline can be either the training-time score distribution or a recent stable period.

Method: PSI on score distributions + summary statistic tracking

The MLE computes PSI on the score distribution (10 bins) and also tracks:

  • Mean score over time (trend detection)

  • Score decile boundaries over time (are thresholds shifting?)

  • Proportion of scores in each risk tier (for Risk Stratification: what % is High/Medium/Low?)

Prediction drift without feature drift is unusual and suggests a bug. Prediction drift with feature drift suggests a real population shift or concept drift. Feature drift without prediction drift means the model is robust to that particular shift — no action needed.

Scenario Feature Drift? Prediction Drift? Likely Cause Action
A No No Stable None
B Yes No Model is robust Log and monitor
C Yes Yes Population shift or concept drift DS investigates; possible retrain
D No Yes Bug in scoring pipeline MLE investigates immediately

6.3 Concept Drift

Concept drift means the relationship between features and the target has changed — the model’s learned patterns are no longer valid. This is detected through performance monitoring (Section 7), not through distribution checks. In healthcare, concept drift happens when:

  • CMS changes the HCC risk adjustment model

  • A new drug or treatment changes utilization patterns

  • A pandemic creates a temporary shift in healthcare behavior

  • State regulations change covered benefits

Concept drift is the most dangerous form of drift because it means the model is confidently wrong. It is also the hardest to detect quickly because it requires ground truth, which in healthcare is delayed.


7. Performance & Accuracy Monitoring

7.1 The Ground Truth Lag Problem

For healthcare models (especially actuarial models reliant on claims), ground truth does not arrive in real time. The delay depends on the model:

Model What “Ground Truth” Is Typical Lag
Risk Stratification Actual total cost of care for the member over the prediction period 6-12 months (claims run-out)
Palliative Care Whether the member was enrolled in palliative/hospice care, or died, within the prediction window 3-6 months
Concurrent Risk Score Actual concurrent period cost (after claims maturity) 3-6 months (IBNR completion)
Impact Model Measured ROI or clinical outcome of the intervention vs. comparison group 6-12 months

This lag means you cannot compute real-time accuracy. Instead, performance monitoring operates on a delayed, retrospective basis. The monitoring pipeline joins historical predictions with matured ground truth and computes performance metrics on a rolling basis.

7.2 Performance Metrics by Model

Risk Stratification:

Metric Description Alert Threshold
AUC-ROC Discrimination — can the model separate high-cost from low-cost members? Drop > 0.03 from baseline
Predictive Ratio Predicted cost / Actual cost, overall and by decile. Should be ~1.0. Outside 0.90 – 1.10
Calibration by Risk Tier Mean predicted vs. actual cost for each risk tier (High/Med/Low) Any tier off by > 15%
R² on Cost Variance in actual cost explained by predicted cost Drop > 0.05 from baseline
Decile Lift Ratio of actual cost in top decile to average. Measures concentration. Drop > 10%

Palliative Care:

Metric Description Alert Threshold
AUC-ROC Discrimination for palliative care eligibility Drop > 0.03
Sensitivity at 90% Specificity How many true positives are we catching at a fixed false positive rate? Drop > 0.05
Positive Predictive Value at Operating Threshold Of those flagged, how many truly needed palliative care? Drop > 10%
Calibration Curve Predicted probability vs. observed rate, across deciles Visual deviation
Brier Score Overall calibration + discrimination combined Increase > 0.02

Concurrent Risk Score:

Metric Description Alert Threshold
Predictive Ratio Predicted risk score / Actual concurrent cost, by score band Outside 0.90 – 1.10
R² on Cost Explanatory power Drop > 0.05
Mean Absolute Error by Score Band Accuracy within each score tier Increase > 15% in any band
Population-Level Accuracy Total predicted cost vs. total actual cost (aggregate calibration) Off by > 3%

Impact Model:

Metric Description Alert Threshold
Estimated Treatment Effect Stability Is the measured impact consistent over time? Change > 20% from baseline
Covariate Balance Are treatment and comparison groups still balanced on observables? Standardized mean difference > 0.1 on any key covariate
Statistical Significance Is the measured impact still statistically significant? p-value crossing 0.05
ROI Estimate Stability Is the financial ROI estimate stable as more data matures? Variance > 25% quarter-over-quarter

7.3 Performance Monitoring Pipeline — Implementation

Who builds it: The MLE builds the pipeline infrastructure. The DS defines the metrics and validation logic. The ACT defines the alert thresholds.

The pipeline runs as a scheduled Databricks Workflow, typically monthly (aligned with claims maturity cycles):


Step 1: Identify scoring runs with matured ground truth

        (e.g., predictions from 6+ months ago where claims have run out)

  

Step 2: Join predictions with ground truth outcomes

        (scored_output JOIN claims_summary ON member_id, prediction_period)

  

Step 3: Compute performance metrics (AUC, calibration, predictive ratio, etc.)

  

Step 4: Write metrics to performance_metrics Delta table

        with columns: model_name, metric_name, metric_value, evaluation_date,

                      prediction_date_range, ground_truth_as_of_date, model_version

  

Step 5: Compare current metrics to baseline (training-time metrics stored in MLflow)

  

Step 6: If any metric crosses alert threshold → trigger alert

  

Step 7: Update monitoring dashboard

7.4 Subpopulation Monitoring

Aggregate metrics can mask subpopulation degradation. A model can have stable overall AUC while performing terribly for a specific subgroup. For healthcare actuarial models, this is both a performance issue and a fairness issue.

The DS and ACT jointly define subpopulations to monitor:

  • Age bands (pediatric, adult, 65+)

  • Line of business (Medicare, Medicaid, Commercial, ACA Exchange)

  • Chronic condition cohorts (diabetes, ESRD, behavioral health, oncology)

  • Geography (state, region, urban/rural)

  • New members vs. continuing members (new members have incomplete claims history)

Performance metrics are computed for each subpopulation. An alert fires if any subpopulation’s metric crosses its threshold, even if the overall metric is stable.

Who monitors subpopulations: The MLE builds the computation. The DS reviews the results. The ACT defines which subpopulations matter and what “fair” performance looks like. The Sec/Compliance team reviews for disparate impact.


8. Retraining Pipeline & Champion/Challenger

8.1 When to Retrain

Retraining is not scheduled on a fixed cadence by default. It is triggered by evidence:

Trigger Source Who Decides
Feature drift PSI > 0.25 on multiple key features Drift monitoring pipeline DS recommends, ACT approves
Prediction drift PSI > 0.25 Drift monitoring pipeline DS recommends, ACT approves
Performance metric crosses alert threshold Performance monitoring pipeline DS recommends, ACT approves
External event (CMS HCC model update, ICD code revision, regulatory change) Actuarial team / industry knowledge ACT initiates, DS executes
Scheduled periodic refresh (if org policy requires it) Calendar (e.g., annual for Risk Strat) ACT mandates

Note that in all cases, the Actuarial team has approval authority. This is a healthcare governance requirement — model changes can affect member care and financial projections.

8.2 The Retraining Workflow

Mermaid Chart - RetrainingPipeline.png

8.3 Shadow Scoring

For high-risk models (Risk Stratification and Concurrent Risk Score directly affect financial projections), the MLE implements shadow scoring before full promotion:

  1. The Production scoring pipeline continues using the champion model for official output.

  2. A parallel pipeline scores the same input data with the challenger model and writes results to a shadow_scores Delta table.

  3. The DS and ACT compare champion and challenger outputs side-by-side for 1-2 scoring cycles.

  4. Only after shadow scoring confirms stability does the ACT approve promotion.

Shadow scoring adds cost (double compute) but dramatically reduces risk for models that drive actuarial filings or care management programs.

8.4 Automated Retraining Pipeline

The MLE builds the retraining pipeline as a parameterized Databricks Workflow:

  1. Fetch latest training data from Gold feature tables (with a defined lookback window).

  2. Train using the same code that was validated in CI/CD (pulled from Git, not copy-pasted).

  3. Log everything to MLflow (parameters, metrics, artifacts, data version).

  4. Register the new model in MLflow Registry as a new version.

  5. Run automated validation against the current champion.

  6. Generate validation report and notify the DS and ACT.

The pipeline does not auto-promote. It prepares everything for human review. (Assumsing gated ‘human in the loop’ governance and process controls)


9. Alerting, Escalation & Incident Response

9.1 Alert Tiers

Tier Severity Example Response Time Who Is Notified
P1 — Critical Scoring pipeline failed, no output produced Workflow failure, OOM error, data table missing < 1 hour MLE (paged), DE, DS
P2 — High Scoring succeeded but output is suspect Null predictions > 0, row count mismatch, extreme score distribution shift < 4 hours MLE, DS, ACT
P3 — Medium Drift detected, investigation needed Feature PSI > 0.25, prediction drift detected < 1 business day DS, MLE, ACT (informed)
P4 — Low Performance metric degraded (lagged) AUC declined by 0.02, predictive ratio shifted to 0.92 < 1 week DS, ACT, MLE (informed)

9.2 Escalation Path


Alert fires

    │

    ▼

MLE triages (is it infra, data, or model?)

    │

    ├──▶ Infrastructure issue → Infra team (cluster, network, permissions)

    │

    ├──▶ Data issue → DE triages (stale data, schema change, dbt failure)

    │        │

    │        ▼

    │    DE fixes data pipeline → MLE re-runs scoring

    │

    └──▶ Model issue → DS investigates (drift, degradation, bug)

             │

             ├──▶ Minor: DS documents, adjusts thresholds, monitors

             │

             └──▶ Major: DS recommends retraining → ACT approves → retrain workflow

9.3 Incident Postmortem

Every P1 and P2 incident gets a postmortem within 5 business days. The MLE facilitates and documents:

  • Timeline of events

  • Root cause (5 Whys)

  • Impact (which downstream systems were affected, for how long)

  • Remediation (what was done to fix it)

  • Prevention (what changes prevent recurrence)

Postmortems are stored in the Central Repository (i.e Azure DevOps wik and linked to the relevant ADO work item). For healthcare models, postmortems also note whether any member care decisions were affected by the incident.


10. Healthcare-Specific Considerations

10.1 HIPAA and PHI in Monitoring

Monitoring dashboards and alert messages must not contain PHI. This means:

  • Drift metrics are computed on aggregate distributions, not individual members. Safe.

  • Performance metrics are computed on aggregates. Safe.

  • Scoring run logs contain summary statistics, not member-level data. Safe.

  • Alert messages reference run IDs and metric values, never member IDs. Safe.

  • Danger zone: Debugging a scoring failure may require inspecting individual records. This must happen in the Production workspace with appropriate access controls, and access must be logged via Unity Catalog audit.

Responsible role: Security & Compliance defines the rules. MLE builds the dashboards to comply. Infra configures audit logging.

10.2 Regulatory Model Validation

Some of these models may be subject to regulatory review (e.g., CMS for Medicare Advantage risk adjustment, state DOI for rate filings). The monitoring system must support:

  • Audit trail: Every model version, training dataset version, validation report, and promotion decision is preserved and retrievable.

  • Reproducibility: Given a model version and a data version, any historical scoring run can be reproduced exactly. Delta Lake time travel and MLflow artifact storage make this possible.

  • Documentation: Model cards, validation reports, and monitoring summaries must be exportable for regulatory submission.

Responsible role: Actuarial SME prepares regulatory documentation. DS provides technical content. MLE ensures the infrastructure supports reproducibility. Sec/Compliance reviews before submission.

10.3 Claims Maturity and IBNR

Models such as a Concurrent Risk Score model is particularly sensitive to claims maturity. Incurred But Not Reported (IBNR) claims mean that recent claims data is incomplete. The monitoring pipeline must account for this:

  • Feature drift checks should compare against a training distribution that also had similar claims maturity (e.g., compare 3-month matured data against 3-month matured training data, not against fully matured training data).

  • Performance metrics should only be computed on fully matured periods (typically 6+ months of run-out).

  • The DE tags data with a claims_maturity_flag indicating completeness.

10.4 Annual HCC Model Updates

CMS updates the HCC risk adjustment model annually. When this happens:

  1. ACT notifies the team of the update and its implications.

  2. DE updates the HCC grouper logic in the dbt pipeline.

  3. DS evaluates whether the Risk Stratification and Concurrent Risk Score models need retraining (they almost certainly do).

  4. MLE coordinates the retraining and re-validation cycle.

  5. ACT approves the updated models before the effective date.

This is a planned event, not a monitoring-triggered event. It should be in the team’s annual calendar.


11. Reference Architecture Diagram

Mermaid Chart - MLOps_Reference_Arch.png

Appendix A: Key Delta Tables for Monitoring

Table Schema Owner Write Cadence
monitoring.scoring_runs run_id, model_name, model_version, score_date, input_rows, output_rows, score_stats, status, runtime MLE Every scoring run
monitoring.drift_metrics model_name, feature_name, metric_type (PSI/KL/JS), metric_value, scoring_date, reference_date, alert_triggered MLE Every scoring run
monitoring.performance_metrics model_name, metric_name, metric_value, evaluation_date, prediction_period, ground_truth_as_of, model_version MLE + DS Monthly (lagged)
monitoring.alert_history alert_id, model_name, tier, description, triggered_at, resolved_at, resolved_by, root_cause MLE On alert
monitoring.model_promotions model_name, from_version, to_version, promoted_by, approved_by, promotion_date, validation_report_path MLE On promotion
data_quality.test_results test_name, table_name, status, tested_at, failure_details DE (dbt) Every dbt run

This document is a living artifact. It should be reviewed quarterly by the MLOps Engineer, Data Scientist, and Actuarial SME, and updated as the platform matures, models evolve, and organizational practices change.